
As the 2022 Philippine elections draw near, the question on the mind of many citizens is how will Filipinos select our new set of country leaders. There has been much turmoil in recent news regarding controversial political alliances, fake news and questionable political machineries, so much so that Filipinos have accused each other of being blindly loyal to a particular party without leaving room for discourse.
This study aims to give a better understanding of the preferences that drive the selection of candidates by Filipino voters. Using Frequent Itemset Mining, the group analyzed publicly available COMELEC datasets on the 2016 election results and the corresponding voter profiles. The frequent itemsets of candidates on the city/municipality level were then extracted in order to determine whether Filipinos tend to vote for candidates only belonging to a single party or not.
After conducting the analysis, the key insight uncovered was that Filipinos generally do not vote straight. This indicates the openness of most Filipinos to vote for candidates from a varied set of parties regardless of their political alliances, and reflects the importance Filipinos place on an individual candidate's merits.
The findings of this study may be used to reduce animosity amongst supporters of competing political parties and encourage political discourse, leading to a more wholistic approach to voter education.
Research regarding voter preference has increasingly become a topic of interest in today's climate, given the current state of the country where the choices of candidates running for office are becoming less and less representative of citizens. The Philippines has long been plagued by issues of underrepresentation, and this is made more complicated by the country's preplexing fractional party list system (Fermin, 2001; Jallorina, 2021; Teehankee, 2020).
The party list system in the Philippines has been described as a system of patronage, where political parties can be “split, merged, regurgitated and repackaged”, resulting in a multitude of groups with varying types of members (Teehankee, 2020). In the current political climate, political parties have been less associated with their ideologies and policies and more on their political mechanisms and strong voter base. With this complex structure, it is therefore necessary to examine how voters make their decisions on chosen elected officials–whether they are made on the basis of a candidate’s party membership or on a candidate’s individual merits and qualifications.
In this study, frequently occurring sets of candidates are uncovered among different regions using publicly-available data from the 2016 Philippine elections. This aims to shed light on whether a given region tends to "vote straight" or vote for sets of candidates from the same party. We applied an exploratory data analysis using official government information from the Philippine Commission on Elections (COMELEC), most data of which was gathered from a public Python scraper (Alis, 2016).
Voter behavior is as much a factor in the success of democracy in the Philippines as the very candidates running for election. Throughout our history, the citizens of the country have been exposed to a number of distressing electoral events such as vote-buying, fraud and even dictatorship, to name a few. In light of the upcoming 2022 elections, it would therefore be worth studying the patterns and changes in voters’ behavior, to better understand how our citizens make decisions on electing our country’s leaders and shaping the economic future of the Philippines.
The main objective of this study is to answer this question: Do voters tend to vote straight from the same party in elections? And if so, which areas in the Philippines exhibit this tendency?
More specifically, this study aims to answer the following research questions:
Voter motivation: This study aims to describe the voting behavior among different regions in the Philippines. This is important because it allows us to see if certain regions are loyal to their favorite parties, whether endorsement by a party head will improve a candidate’s election results in that area, or whether that region is more concerned with the individual qualifications of a given candidate rather than the party they are associated to.
Democratic elections: The results may also shed light on the issue on whether democracy is truly being expressed in Philippine elections. Regions where voters are susceptible to vote buying may also show a tendency to vote straight during elections, due to party members buying votes for the entire party along with themselves.
Voter education: With the results of this study, regions that exhibit a tendency to vote straight may be further studied on whether lack of voter education contributes to this behavior. It is possible that voters in these regions do not have access to reliable information on their candidates, therefore they make a sweeping decision and choose on the basis of a party head’s endorsement instead of information on individual candidates.
Three publicly available election-related datasets were used in this study. Shown below in Table 1 is a summary of the description, type, and sources of these datasets. These datasets were then merged using the City/Municipality feature (see Table 2) of the raw 2016 election results and the profiles of the registered voters. Using these datasets, the researchers were able to extract important variables that were used in this study, which are described in Table 2. Details of the data preprocessing in the are described in the next section.
| No. | Dataset | Description | Type | Source |
|---|---|---|---|---|
| 1 | 2016 Raw election results (“Election results”) | Vote counting machine (VCM) election results per electoral precinct | JSON (nested JSON files) | Extracted JSON files using the scraper of Prof. Christian Alis. Ultimate source is COMELEC |
| 2 | 2016 Voter profile by provinces and cities or municipalities (“Voter profile”) | Profiles of the registered voters per city/municipality | CSV | COMELEC |
| 3 | Database of Global Administrative Areas Philippine Map | A database of shapefiles containing location boundaries of countries up to the barangay level | DB (stored in a PostgreSQL database) | Database stored on Jojie for the Geospatial Analysis (GSA) class. Ultimate source is GADM |
| No. | Variables | Variable name/s | Data type | Description | Dataset |
|---|---|---|---|---|---|
| 1 | Candidate name and position to run | bName, Position | STR | Name of the candidates, the Position variable differentiates the candidates for President, Vice President and Senator | Election results |
| 2 | Number of votes per candidate | votes | INT | Number of votes per candidate In the exploratory data analysis (EDA), the votes were normalized per region by calculating the share (%) of votes of a particular candidate relative to the total city/municipality, region, or any other summation that is of interest | Election results |
| 3 | Number of ballots canvassed | num_voted | INT | Number of registered voters that casted their vote on election day | Election results |
| 4 | Location | Region, Province, City/Municipality | STR | This serves as the variable that links the Voter profile and Election results dataset (the “merging key”) | Election results and voter profile |
| 5 | Number of registered voters | registered_voter | INT | Number of registered voters per city and municipality in the 2016 national elections | Voter profile |
| 6 | Sex | male, female | STR | Biological sex of registered age groups | Voter profile |
| 7 | Age groups | 17-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-above | Age groups of registered voters | Voter profile | |
| 8 | Literacy rate | literacy | FLOAT | The percentage of the population 10 years old and over, who can read, write and understand simple messages in any language or dialect (PSA) | Voter profile |
| 9 | Indigenous people | indigenous_people | INT | A group of people or homogenous societies identified by self-ascription and ascription by others, who have continuously lived as organized community (PSA) | Voter profile |
| 10 | Person with disability | person_with_disability | INT | Persons with disabilities including those who have long-term physical, mental, intellectual and other forms of impairments | Voter profile |
| 11 | Marital status | single, married, widow | STR | Legally defined marital state | Voter profile |
| 12 | Shapefile (city/municipality level) | geom | GEOM | Shapefiles for each city/municipality, to be used for plotting maps | GADM |
After merging the election results and the voter profile information, the resulting DataFrame had 1649 rows × 25 columns. The rows of the dataset correspond to the Region, District, and City/Municipality for each area in which the Philippine elections were held. Columns refer to the election outcome per presidential, vice-presidential and senatorial candidate (number of votes) and voter profile (e.g. sex, age, literacy rate, among others).
The methodology for this study is summarized in the following nine (9) steps, which will be discussed in the succeeding subsections.
Because the motivation of the study is to identify voter preferences among different regions, the merging schema of Region, then District, and City/Municipality was followed. We noted however, that some values of City/Municipality were not consistent for both datasets. For example, the value “Bacoor City” was used in the Election results data, but in the Voters profile it was referred to as “Bacoor”. It was also observed that most City/Municipality values are similar across different Region or District values. Finally, the datasets also had differences in the way in which the Region is named. For example, the province of Negros Occidental was classified to a region named NIR (corresponding to Negros Island Region) on the election results but was named part of Region VI in both the voter profile and the GADM datasets. Thus, cleaning the datasets contributed to a significant portion of our workflow. For more information on the data cleaning, kindly refer to Section 5.3: Cleaning the data.
Figure 2 below shows the structure of the Election results data, which are stored in nested JSON files. As illustrated in the figure, the raw electoral data is stored in the innermost nest of the JSON file path. We thus needed to extract the data from 271,921 JSON files corresponding to 90,641 unique electoral precincts times three positions considered.
import os
os.environ['MPLCONFIGDIR'] = os.getcwd() + "/configs/"
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.max_colwidth = 200
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import plotly.graph_objects as go
import plotly.express as px
from sklearn.cluster import Birch, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import calinski_harabasz_score, silhouette_score
import geopandas as gpd
import seaborn as sns
import re
import fim
import json
import glob
import psycopg2
import re
import pickle
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
To be able to extract the Elections results data, we first needed to store the JSON file paths in a csv file. Next, a for loop was used to open and extract the relevant variables needed. The data frame was then loaded into the variable file_df. The extraction of the JSON filepaths took 45 minutes.
# takes 45 minutes to run
def extract_filepaths():
"""Create a dataFrame from filepaths of JSON files.
Create a dataFrame from filepaths of all json files under
the `/mnt/data/public/elections/nle2016/PHILIPPINES/`
folder.
Add a column `path` containing the whole filepath.
This function would take 45 minutes to run on Jojie.
Returns
-------
pandas DataFrame
columns : `Region`, `District`, `City/Municipality`,
`Barangay`, `Precint`, `File`, `path`
"""
path = ('/mnt/data/public/elections/'
+ 'nle2016/PHILIPPINES/**/*PHILIPPINES.json')
file_list = []
for file in tqdm(glob.iglob(path, recursive=True)):
file_list.append(file)
file_comp = []
for i in tqdm(file_list):
comp = os.path.normpath(i).split(os.sep)
file_comp.append(comp)
election_results_df = (pd.DataFrame(file_comp,
columns=[0,1,2,3,4,5,6,
'Region', 'District',
'City/Municipality',
'Barangay',
'Precinct',
'File'])
.drop(columns=[0,1,2,3,4,5,6]))
election_results_df['path'] = file_list
election_results_df.to_csv('election_filepaths.csv', index=None)
extract_filepaths()
def file_names(df):
"""Return a list of file names for the election data to be extracted."""
regions = list(df['Region'].unique())
posfiles = list(df['File'].unique()[1:])
positions = ['pres', 'vp', 'senator']
csv_list = []
for pos in positions:
for reg in regions:
x = reg.lower().split(' ')
if len(x) == 2:
x = pos + '_' + x[0] + x[1] + '.csv'
else:
x = pos + '_' + str(x[0]) + '.csv'
csv_list.append(x)
df = pd.DataFrame({'reg':regions*3, 'poscsv':csv_list})
df['posfile'] = posfiles[0]
df['posfile'][19:38] = posfiles[2]
df['posfile'][38:] = posfiles[1]
df['position'] = positions[0]
df['position'][19:38] = positions[1]
df['position'][38:] = positions[2]
return df.apply(tuple, axis=1).tolist()
file_df = pd.read_csv('election_filepaths.csv')
parameters = file_names(file_df)
By being able to access all filepaths of the election results, the code below would open through all JSON files and get the relevant variables for the study -- such as the name of the candidate, number of votes, the precinct location, and the voter turnout in the area. The 2016 election results for the position of President, Vice President, and Senator were saved as CSV files per region into three different directories named pres, vp, and senator respectively. This entire process took two hours and forty minutes to complete.
# takes two hours and forty minutes to run
def generate_csv_files(region, poscsv, posfile, position):
"""Create `csv` file from `json` file per position.
Read `election_filepaths.csv` and retrieve rows pertaining to
specified position and region. This creates csv files that correspond
to the number of votes garnered by each political candidate that
ran for president, vice president, and senator in the 2016 elections.
This block of code takes two hours and forty minutes to run on Jojie.
"""
pos_df = (file_df[(file_df['File'] == posfile) &
(file_df['Region'] == region)])
pos_conso = pd.DataFrame(columns=['bName', 'votes', 'percentage',
'canCode', 'Region','District',
'City/Municipality', 'Barangay',
'Precinct', 'num_voted', 'reg_voters'])
for i in tqdm(pos_df.index):
path = os.path.normpath(pos_df['path'][i])
# open json file and save relevant content as CSV
with open(path, 'r') as f:
data = json.load(f)
pos = pd.DataFrame.from_dict(data['results'])
pos['cClass'] = data['cClass']
pos['Position'] = position
pos['Region'] = pos_df['Region'][i]
pos['District'] = pos_df['District'][i]
pos['City/Municipality'] = pos_df['City/Municipality'][i]
pos['Barangay'] = pos_df['Barangay'][i]
pos['Precinct'] = pos_df['Precinct'][i]
pos['num_voted'] = \
data['stats']['regionInfo']['rows'][-1]['value']
pos['reg_voters'] = \
data['stats']['regionInfo']['rows'][-2]['value']
pos_conso = pd.concat([pos_conso, pos])
# save CSV files in a directory
pos_conso.to_csv(position + '/' + poscsv, index=None)
# implement data extraction
for i in range(len(parameters)):
generate_csv_files(parameters[i][0], parameters[i][1],
parameters[i][2], parameters[i][3])
print(f'{parameters[i][0]} {parameters[i][3]} file done!')
All datasets that were extracted per region would be stacked together per position (president, vice president, and senator) in the subsequent code.
# takes one minute and 30 seconds to run
pres_paths = [parameters[i][3] + '/' + parameters[i][1] for i in range(19)]
vp_paths = [parameters[i][3] + '/' + parameters[i][1] for i in range(19, 38)]
senator_paths = [parameters[i][3] + '/' + parameters[i][1]
for i in range(38, 57)]
def combine(paths):
"""Stacks all the per-region datasets extracted into one DataFrame."""
df_orig = pd.DataFrame()
for path in paths:
df = pd.read_csv(path)
df_orig = pd.concat([df_orig, df], axis=0)
return df_orig
df_pres = combine(pres_paths)
df_vp = combine(vp_paths)
df_senator = combine(senator_paths)
To further provide motivation to the study, the accompanying dataset which would explore the demographics of the 2016 National Election voters would also be explored.
df_profile = pd.read_csv('/mnt/data/public/elections/comelec/voters_profile/'
'philippine_2016_voter_profile_by_provinces_and'
'_cities_or_municipalities_including_districts.csv')
df_profile = df_profile.rename(columns=
{'region':'Region',
'province':'District',
'city_or_municipality_including_districts':
'City/Municipality'})
def location_cleaner(df, col):
"""Match the formatting of location data from election results."""
df[col] = df[col].str.upper()
df[col] = df[col].str.strip()
return df
def trimmer(row):
"""Delete whitespaces in between words."""
if len(row.split()) > 1:
return re.sub(r' +',' ', row)
else:
return row
def N_tilde(row):
"""Replace Ñ to N for easier processing."""
if 'Ñ' in list(row):
return row.replace('Ñ', 'N')
else:
return row
df_profile = location_cleaner(df_profile, 'Region')
df_profile = location_cleaner(df_profile, 'District')
df_profile = location_cleaner(df_profile, 'City/Municipality')
df_profile['literacy'] = df_profile['literacy'].str.rstrip('%').astype(float)
dfs = [df_profile, df_pres, df_vp, df_senator]
for df in dfs:
# automated extraneous white space in the middle of cities trimmer
df['City/Municipality'] = df['City/Municipality'].apply(trimmer)
# automated way of changing N-tilde to just N
df['City/Municipality'] = df['City/Municipality'].apply(N_tilde)
The two datasets, Election results (JSON file) and Voters profile (CSV file), needed to be properly merged using the location schema: Region, District, then City/Municipality. This meant the values of this key needed to be matched between datasets. However, despite both datasets ultimately coming from COMELEC, we observed that the merging variable City/Municipality had a number of nuances, as noted in Section 5.1 Examining the three main datasets. These were mainly a mismatch on the labelling, spelling, and tagging of cities/municipalities in the voter profile and election results data.
The unique considerations in cleaning data and their respective remedy are summarized in Table 3 below.
| No. | Data considerations | Remedies |
|---|---|---|
| 1 | Some cities/municipalities in the JSON file have multiple spaces (e.g. DAPITAN CITY) whereas the CSV file only has one. The Election results also label the cities in uppercase whereas the CSV files are in title case capitalization. |
We created and applied functions that would handle these nuances in the data. The functions does the following: (1) Splits the City/Municipality using the space (if there is any) as a delimiter and then re-joins them together with a single space. This effectively addresses the cities with multiple spaces in between. (2) Converts all of the City/Municipality data into uppercase. |
| 2 | The character Ñ is properly read in the Election results (JSON file), but not in the Voters profile (CSV file). | For the purposes of matching and the analysis, we converted all of the “Ñ” and “ñ” into “N” and “n”. Example: The N-tilde in PEÑABLANCA was removed, and has now become PENABLANCA. |
| 3 | There are differences in the tagging of cities with respect to their districts/partitions within the city. Examples (select only) are as follows: CSV file: “City of Makati, 1st District; City of Makati, 2nd District” vs JSON file: “CITY OF MAKATI” CSV file: “Island Garden City of Samal, Babak District; Island Garden City of Samal, Kaputian District; Island Garden City of Samal, Samal District” vs JSON file: “ISLAND GARDEN CITY OF SAMAL” | The cities that have partitions into different districts (e.g. “City of Makati, 1st District; City of Makati, 2nd District”) are labelled using the name of their ‘parent city’ (e.g. “CITY OF MAKATI”). This applies to other cities with the same case as well. |
| 4 | There are differences in the spelling of city labels (e.g. TBOLI vs T’BOLI; Mlang vs M’LANG) | We were able to filter which cities in the CSV file do not have a corresponding match in JSON (or vice versa). We manually handle these differences in the spelling to match each other. |
| 5 | The JSON file has considered “Negros Island Region” (NIR) as a separate region altogether. However in the CSV file, Negros Occidental and Negros Oriental are included in Region VI and Region VII, respectively. | We conformed to the tagging of the CSV wherein Negros Occidental and Negros Oriental should be appended to the dataframes for Region VI and Region VII, respectively. |
| 6 | There were unmatched cities even after applying the remedies in Data considerations nos. 1 to 5. Examples: ‘Banga’ and ‘Esperanza’ in Region 12 (CSV) did not have matching cities in the JSON ‘Maconacon’ in Region 2 (CSV) did not have any matching cities in the JSON | After all of the automated and manual handling of the data, if cities in the CSV remain unmatched to JSON, we drop the cities in the CSV file. This is further handled in Section 5.4 Merging and storing to a consolidated dataframe as well. |
There are also instances in which manual replacement was undertaken to match the election data and voter profile datasets, such as in the case of Isabela City and Cotabato City - which were named as a Special Province in its respective Region column.
def city_matcher(df, orig_name, replace_name, col='City/Municipality'):
"""Replace names of cities to match information to JSON files."""
df[col] = df[col].replace({orig_name: replace_name})
return df
# enumerate all the non-matching pairs between voter demographic data and
# election results, then match them following election results
correction_pairs = [
('POZZORUBIO', 'POZORRUBIO'),
('ILAGAN','ILAGAN CITY'),
('BACOOR','BACOOR CITY'),
('IMUS', 'IMUS CITY'),
('BINAN CITY', 'CITY OF BINAN'),
('CABUYAO CITY', 'CABUYAO'),
('BROOKES POINT', 'BROOKE\'S POINT'),
('LIGAO CITY', 'CITY OF LIGAO'),
('CEBU CITY, 1ST DISTRICT', 'CEBU CITY'),
('CEBU CITY, 2ND DISTRICT', 'CEBU CITY'),
('ZAMBOANGA CITY, 1ST DISTRICT', 'ZAMBOANGA CITY'),
('ZAMBOANGA CITY, 2ND DISTRICT', 'ZAMBOANGA CITY'),
('DAVAO CITY, 1ST DISTRICT', 'DAVAO CITY'),
('DAVAO CITY, 2ND DISTRICT', 'DAVAO CITY'),
('DAVAO CITY, 3RD DISTRICT', 'DAVAO CITY'),
('ISLAND GARDEN CITY OF SAMAL, BABAK DISTRICT',
'ISLAND GARDEN CITY OF SAMAL'),
('ISLAND GARDEN CITY OF SAMAL, KAPUTIAN DISTRICT',
'ISLAND GARDEN CITY OF SAMAL'),
('ISLAND GARDEN CITY OF SAMAL, SAMAL DISTRICT',
'ISLAND GARDEN CITY OF SAMAL'),
('BANISILAN', 'KABACAN'),
('MLANG', "M'LANG"),
('President Roxas', 'KABACAN'),
("TBOLI", "T`BOLI")
]
for pair in correction_pairs:
df_profile = city_matcher(df_profile, pair[0], pair[1])
# exception cases, such as changes in districts and special provinces
df_profile = city_matcher(df_profile, 'CARAGA', 'REGION XIII', col='Region')
df_profile['City/Municipality'] = (df_profile['City/Municipality']
.apply(lambda x: x.split(',')[0]))
df_profile = city_matcher(df_profile, 'TAGUIG CITY', 'TAGUIG')
df_profile = city_matcher(df_profile, 'FOURTH DISTRICT',
'NATIONAL CAPITAL REGION - FOURTH DISTRICT',
col='District')
df_profile = city_matcher(df_profile, 'MANILA',
'NATIONAL CAPITAL REGION - MANILA', col='District')
df_profile = city_matcher(df_profile, 'SECOND DISTRICT',
'NATIONAL CAPITAL REGION - SECOND DISTRICT',
col='District')
df_profile = city_matcher(df_profile, 'THIRD DISTRICT',
'NATIONAL CAPITAL REGION - THIRD DISTRICT',
col='District')
df_profile = city_matcher(df_profile, 'DAVAO (DAVAO DEL NORTE)',
'DAVAO (DAVAO DEL NORTE)', col='District')
isabela = (df_profile
[df_profile['City/Municipality'] == 'ISABELA CITY']
.replace({'Region':{'SPECIAL PROVINCES':'ARMM'},
'District':{'SPECIAL PROVINCES':'BASILAN'}}))
cotabato = (df_profile
[df_profile['City/Municipality'] == 'COTABATO CITY']
.replace({'Region':{'SPECIAL PROVINCES':'ARMM'},
'District':{'SPECIAL PROVINCES':'MAGUINDANAO'}}))
df_profile = pd.concat([df_profile[2:], isabela, cotabato])
# takes one minute to run
# match district name of Taguig-Pateros to voter profile data
df_pres = city_matcher(df_pres, 'TAGUIG - PATEROS',
'NATIONAL CAPITAL REGION - FOURTH DISTRICT',
col='District')
df_vp = city_matcher(df_vp, 'TAGUIG - PATEROS',
'NATIONAL CAPITAL REGION - FOURTH DISTRICT',
col='District')
df_senator = city_matcher(df_senator, 'TAGUIG - PATEROS',
'NATIONAL CAPITAL REGION - FOURTH DISTRICT',
col='District')
# convert extracted turnout numbers in numeric form
for df in [df_pres, df_vp, df_senator]:
df['num_voted'] = (df['num_voted'].astype(str).str
.replace(',', '').astype(int))
df['reg_voters'] = (df['reg_voters'].astype(str)
.str.replace(',', '').astype(int))
def region_correct(region, district):
"""Assign NIR districts to their respective region numbers."""
if region == 'NIR' and district == 'NEGROS OCCIDENTAL':
return 'REGION VI'
elif region == 'NIR' and district == 'NEGROS ORIENTAL':
return 'REGION VII'
else:
return region
# encode correct region of Negros Occidental and Negros Oriental
df_pres['Region'] = df_pres.apply(lambda x: region_correct(x['Region'],
x['District']),
axis=1)
df_vp['Region'] = df_vp.apply(lambda x: region_correct(x['Region'],
x['District']), axis=1)
df_senator['Region'] = df_senator.apply(lambda x:
region_correct(x['Region'],
x['District']), axis=1)
In order to be able to merge the election results and the voter profile information along with the shapefiles provided from GADM, the geospatial dataset has to be cleaned in order to conform the formatting of their respective Region, District and City/Municipality to the format of the election results. With this, the usual functions that capitalizes the string information for each location, trimming extraneous whitespaces, and the conversion of Ñ to N was used.
turnout = (df_pres.groupby(['Region', 'District',
'City/Municipality'])[['num_voted',
'reg_voters']].sum())
turnout = turnout.reset_index().drop('Region', axis=1)
# load the GADM database from GSA, publicly accessible
conn = psycopg2.connect(dbname="postgis",
user="gsa2022",
password="g5!V%T1Vmd",
host="192.168.212.99",
port=32771)
ph_shp = gpd.read_postgis("""
SELECT *
FROM gadm.ph
""", con=conn, geom_col='geom')
# pre-preprocessing of shapefile
ph_shp = ph_shp[['name_1', 'name_2','geom']]
ph_shp = ph_shp.rename(columns={'name_1': 'District',
'name_2': 'City/Municipality'})
for col in ['District', 'City/Municipality']:
ph_shp[col] = ph_shp[col].str.upper()
ph_shp['City/Municipality'] = ph_shp['City/Municipality'].apply(trimmer)
ph_shp['City/Municipality'] = ph_shp['City/Municipality'].apply(N_tilde)
Even with these preprocessing steps, we noticed that there are a lot of discrepancies in the way in which the City/Municipality column of the GADM dataset was encoded relative to the election data. With this, we identified the cities/municipalities that do not match with the election results and noted them all in a pickle file in order to be able to replace them with the name that conforms to the convention set in the election results.
Doing this created changes in the values of similarly named cities/municipalities, which had to be converted by taking into account the district or province information already.
with open('ph_shp_replace.pkl', 'rb') as file:
to_replace = pickle.load(file)
for pair in to_replace:
ph_shp = city_matcher(ph_shp, pair[0], pair[1])
triples = [
('ALBAY', 'SANTO DOMINGO', 'SANTO DOMINGO (LIBOG)'),
('BOHOL', 'VALENCIA (LUZURRIAGA)', 'VALENCIA'),
('COMPOSTELA VALLEY', 'MABINI', 'MABINI (DONA ALICIA)'),
('EASTERN SAMAR', 'SALCEDO (BAUGEN)', 'SALCEDO'),
('ISABELA', 'QUIRINO (ANGKAKI)', 'QUIRINO'),
('LA UNION', 'SAN FERNANDO CITY', 'CITY OF SAN FERNANDO'),
('MISAMIS ORIENTAL', 'MAGSAYSAY', 'MAGSAYSAY (LINUGOS)'),
('QUEZON', 'SAN ANDRES (CALOLBON)', 'SAN ANDRES'),
('QUEZON', 'SAN FRANCISCO', 'SAN FRANCISCO (AURORA)'),
('ROMBLON', 'SAN ANDRES (CALOLBON)', 'SAN ANDRES'),
('ROMBLON', 'SANTA MARIA', 'SANTA MARIA (IMELDA)'),
('SOUTH COTABATO', 'SANTO NINO (FAIRE)', 'SANTO NINO'),
('SURIGAO DEL NORTE', 'SAN FRANCISCO', 'SAN FRANCISCO (ANAO-AON)')
]
for g in triples:
filtered = city_matcher(ph_shp[ph_shp['District'] == g[0]], g[1], g[2])
ph_shp.iloc[filtered.index, :] = filtered
Other discrepancies were found on the differences in the naming of the Region and District variable between the election results and the voter profile dataset. Some of these involved cities/municipalities in Davao Occidental, Davao del Norte, North Cotabato, Western Samar, and Metro Manila. Hence, these had to be manually changed.
# align Davao Occidental with election returns
filtered = ph_shp[(ph_shp['District'] == 'DAVAO DEL SUR') &
ph_shp['City/Municipality'].isin(['DON MARCELINO',
'JOSE ABAD SANTOS',
'MALITA', 'SANTA MARIA',
'SARANGANI'])]
filtered = filtered.replace({'District': {'DAVAO DEL SUR': 'DAVAO OCCIDENTAL'},
'City/Municipality':
{'JOSE ABAD SANTOS':
'JOSE ABAD SANTOS (TRINIDAD)'}})
ph_shp.iloc[filtered.index, :] = filtered
# align Davao del Norte with election returns
filtered = ph_shp[(ph_shp['District'] == 'DAVAO DEL NORTE')]
filtered = filtered.replace({'District': {'DAVAO DEL NORTE':
'DAVAO (DAVAO DEL NORTE)'},
'City/Municipality':
{'SAMAL CITY':
'ISLAND GARDEN CITY OF SAMAL',
'ASUNCION': 'ASUNCION (SAUG)',
'TAGUM CITY': 'CITY OF TAGUM'}})
ph_shp.iloc[filtered.index, :] = filtered
# align North Cotabato with election returns
filtered = ph_shp[(ph_shp['District'] == 'NORTH COTABATO')]
filtered = filtered.replace({'District': {'NORTH COTABATO':
'COTABATO (NORTH COT.)'},
'City/Municipality':
{'KIDAPAWAN CITY':
'CITY OF KIDAPAWAN'}})
ph_shp.iloc[filtered.index, :] = filtered
# align Western Samar with election returns
filtered = ph_shp[(ph_shp['District'] == 'SAMAR')]
filtered = filtered.replace({'District': {'SAMAR':
'SAMAR (WESTERN SAMAR)'},
'City/Municipality':
{'PARANAS':
'PARANAS (WRIGHT)',
'SANTO NINO (FAIRE)': 'SANTO NINO'}})
ph_shp.iloc[filtered.index, :] = filtered
# align Metro Manila - Fourth District with election returns
filtered = (ph_shp[(ph_shp['District'] == 'METROPOLITAN MANILA') &
ph_shp['City/Municipality'].isin(['LAS PINAS',
'MAKATI CITY',
'MUNTINLUPA', 'PARANAQUE',
'PASAY CITY', 'PATEROS',
'TAGUIG'])]
.replace({'District': {'METROPOLITAN MANILA':
'NATIONAL CAPITAL REGION - FOURTH DISTRICT'},
'City/Municipality': {'LAS PINAS': 'CITY OF LAS PINAS',
'MAKATI CITY': 'CITY OF MAKATI',
'MUNTINLUPA': 'CITY OF MUNTINLUPA',
'PARANAQUE': 'CITY OF PARANAQUE'}
}))
ph_shp.iloc[filtered.index, :] = filtered
# align Metro Manila - Second District with election returns
filtered = (ph_shp[(ph_shp['District'] == 'METROPOLITAN MANILA') &
ph_shp['City/Municipality'].isin(['MANDALUYONG',
'QUEZON CITY',
'MARIKINA', 'PASIG CITY',
'SAN JUAN'])]
.replace({'District': {'METROPOLITAN MANILA':
'NATIONAL CAPITAL REGION - SECOND DISTRICT'},
'City/Municipality': {'MANDALUYONG':
'CITY OF MANDALUYONG',
'MARIKINA': 'CITY OF MARIKINA',
'PASIG CITY': 'CITY OF PASIG',
'SAN JUAN': 'SAN JUAN CITY'}
}))
ph_shp.iloc[filtered.index, :] = filtered
# align Metro Manila - Third District with election returns
filtered = (ph_shp[(ph_shp['District'] == 'METROPOLITAN MANILA') &
ph_shp['City/Municipality'].isin(['VALENZUELA',
'NAVOTAS',
'KALOOKAN CITY',
'MALABON'])]
.replace({'District': {'METROPOLITAN MANILA':
'NATIONAL CAPITAL REGION - THIRD DISTRICT'},
'City/Municipality': {'KALOOKAN CITY':
'CALOOCAN CITY',
'VALENZUELA': 'CITY OF VALENZUELA',
'MALABON': 'MALABON CITY',
'NAVOTAS': 'NAVOTAS CITY'}
}))
ph_shp.iloc[filtered.index, :] = filtered
# manual replacement of similarly named city/municipality across districts
filtered = (ph_shp[(ph_shp['District'] == 'ILOCOS SUR') &
(ph_shp['City/Municipality'] == 'SAN JUAN')]
.replace({'City/Municipality': {'SAN JUAN': 'SAN JUAN (LAPOG)'}}))
ph_shp.iloc[filtered.index, :] = filtered
filtered = (ph_shp[(ph_shp['District'] == 'SOUTHERN LEYTE') &
(ph_shp['City/Municipality'] == 'SAN JUAN')]
.replace({'City/Municipality': {'SAN JUAN':
'SAN JUAN (CABALIAN)'}}))
ph_shp.iloc[filtered.index, :] = filtered
filtered = (ph_shp[(ph_shp['District'] == 'KALINGA') &
(ph_shp['City/Municipality'] == 'RIZAL')]
.replace({'City/Municipality': {'RIZAL':
'RIZAL (LIWAN)'}}))
ph_shp.iloc[filtered.index, :] = filtered
filtered = (ph_shp[(ph_shp['District'] == 'PALAWAN') &
(ph_shp['City/Municipality'] == 'RIZAL')]
.replace({'City/Municipality': {'RIZAL':
'RIZAL (MARCOS)'}}))
ph_shp.iloc[filtered.index, :] = filtered
Doing all of these preprocessing steps allowed us to conform the GADM dataset with the elections database. Upon merging both datasets following the Region, District, City/Municipality schema, 1647 areas with shapefiles were deemed to have election results. For both instances of merging, information on the election results of the following locations had to be dropped:
districts and the shapefile is not partitioned per district. The election results are not reported per district, but are presented in terms of specific locations such as Binondo, Ermita, Intramuros, etc.The aforementioned information is only taken into account during the calculation of the 2016 election voter turnout, but are dropped already in the subsequent analysis.
Before performing data mining, the team first explored the datasets by understanding the demographics of the 2016 National Election voters along with the corresponding voter turnout per region. Geospatial visualizations were used to be able to identify where did each presidential and vice-presidential candidate won. From this, the team uncovers the demographics of the regions where notable candidates for the presidency and the vice-presidency won. This was followed by exploring the number of partylists and the number of candidates that ran for the National Elections for each party.
We first explore the demographics of voters across different regions.
def plot_region_profile_age(region_name, fignum):
"""Plots percentage distribution of registered voters per age group
in a given region.
"""
# connecting to data
dfp = pd.read_csv('/mnt/data/public/elections/comelec/voters_profile/'
'philippine_2016_voter_profile_by_provinces_and_cities'
'_or_municipalities_including_districts.csv')
# preparing table
profiling = dfp.drop('registered_voter', axis=1)
age_vars = ['17-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
'50-54', '55-59', '60-64', '65-above']
profiling = profiling.drop_duplicates(keep='first')
profiling = profiling.groupby(['region'])[age_vars].sum()
profiling['total'] = profiling.sum(axis=1)
for var in age_vars:
profiling[var] = (profiling[var] / profiling['total']) * 100
profiling = profiling.reset_index('region')
region_data = profiling[profiling['region'] == region_name]
region_data = pd.melt(region_data, id_vars='region', value_vars=age_vars,
var_name='age_group', value_name='percent')
# plot the graph
plt.figure(figsize=(15, 4))
for var in age_vars:
plt.bar(region_data[region_data['age_group'] == var]['age_group'],
region_data[region_data['age_group'] == var]['percent'],
color='purple')
plt.title(f'Figure {fignum}: Percentage distribution of registered voters'
f' per age group in {region_name}\n', fontsize=14)
plt.ylim(0, 20)
plt.ylabel('Percentage of Votes\n')
plt.xlabel('\nAge Group')
return None
plot_region_profile_age('NCR', 3)
For NCR, voters were comprised of generally younger age groups such as 20-24 followed by 25-29. Among the younger voters however, the 17-19 age group had a much lower contribution at 5.83%. It can be observed that as the age groups become older, they contribute to a lesser percentage to total voters. However, an exception is seen for the 65-above age group, where this age group contributed to more than the 60-64 age group. But this may be because of its larger binning size.
plot_region_profile_age('Region IX', 4)
In the Zamboanga Peninsula, voters were also generally comprised of younger age groups, with the 20-24 age group contributing to the most percentage of voters at 15.19%. This was higher than the percentage of this age group for NCR. The 65-above age group contribution was also higher than their NCR counterpart, at 7.69%. Another notable observation is that for this region, the 17-19 age group had the lowest contribution at 4.28%.
plot_region_profile_age('Region X', 5)
For Northern Mindanao, the contribution per age group to the total registered voters exhibited a similar distribution to that of the Zamboanga Peninsula. The age group with the largest contribution was the 20-24 age group at 14.76%, which was also higher than NCR. The 65-above age group also had a larger contribution compared to NCR, at 7.92%. The 17-19 age group however, had a much lower contribution than NCR and other regions in Mindanao, at 3.88%.
plot_region_profile_age('Region XI', 6)
For the Davao Region, the age group distribution was also similar to other regions. But like other regions in Mindanao, the 20-24 age group had a higher contribution compared to its NCR counterpart, at 15.06%. Additionally, the contribution of the 65-above age group was lower than that of NCR, at 5.44%. An interesting observation however is that the 17-19 age group had a 8.14% contribution, much higher than other Mindanao regions including NCR. This may be attributed to the fact that Rodrigo Duterte, the former mayor of Davao City, was running for president.
plot_region_profile_age('Region XII', 7)
For the SOCCSKSARGEN region, the distribution was similar to other Mindanao regions, except it was the 25-29 age group that showed the greatest contribution to total voters, at 14.87%. The 17-19 age group showed a low contribution at 3.15%, similar to Northern Mindanao. Additionally, the 65-above age group had a similar contribution as NCR, with 6.53%.
plot_region_profile_age('ARMM', 8)
The Bangsamoro Autonomous Region of Muslim Mindanao had the most different values in terms of the percentage distribution per age group. Like other regions previously discussed, voters in this region tended to be of younger age groups. But the percentage contribution of these younger age groups were much higher than in other regions. The 20-24 age group contributed to 21.89% of the total voters, and even the 17-19 age group was much higher, at 10.37%.
def plot_pct_feature(feature, fignum):
"""Plots percentage distribution of registered voters per region for
a given feature.
"""
# connecting to data
dfp = pd.read_csv('/mnt/data/public/elections/comelec/voters_profile/'
'philippine_2016_voter_profile_by_provinces_and_cities'
'_or_municipalities_including_districts.csv')
df_plot_tag = 0
if feature == 'age':
# get relevant columns and compute for percentages
dfp['17-29'] = dfp['17-19'] + dfp['20-24'] + dfp['25-29']
dfp['30-49'] = (dfp['30-34'] + dfp['35-39'] + dfp['40-44']
+ dfp['45-49'])
dfp['50-64'] = dfp['50-54'] + dfp['55-59'] + dfp['60-64']
dfp['65+'] = dfp['65-above']
age = dfp.groupby(['region'], as_index=True).agg({'17-29':'sum',
'30-49':'sum',
'50-64':'sum',
'65+':'sum'})
df_plot = age.div(age.sum(1), axis=0)
df_plot_tag = 1
elif feature == 'sex':
sex = dfp.groupby(['region'], as_index=True).agg({'male':'sum',
'female':'sum'})
df_plot = sex.div(sex.sum(1), axis=0)
df_plot_tag = 1
elif feature == 'civil status':
status = (dfp.groupby(['region'], as_index=True)
.agg({'single':'sum',
'married':'sum',
'widow':'sum',
'legally_seperated':'sum'}))
df_plot = status.div(status.sum(1), axis=0)
df_plot_tag = 1
# plot the graph
if df_plot_tag == 1:
df_plot.plot.barh(legend=True, stacked=True, figsize=(15, 8))
plt.ylabel('Region\n')
plt.xlabel('\nPercentage of registered voters')
plt.title(f'Figure {fignum}: '
f'Percentage of registered voters by {feature}\n',
fontsize=14)
elif feature == 'literacy':
dfp['literacy'] = (dfp['literacy'].str
.rstrip('%')).astype(float)
df_plot = (dfp.groupby(['region'], as_index=True)
.agg({'literacy':'mean'}))
# plot the graph
df_plot.plot.barh(legend=True, stacked=False, figsize=(15, 8))
plt.ylabel('Region\n')
plt.xlabel('\nMean literacy rate')
plt.title(f'Figure {fignum}: Mean literacy rate per region\n',
fontsize=14)
return None
plot_pct_feature('age', 9)
Overall, we can see similarities across regions in the Philippines when it comes to the age distribution of voters in a region. Combining different age groups into bins of 17-29, 30-49 50-64 and 65+, we can see that the 17-29 and 30-49 age groups comprise most of the voters across regions. Additionally, it seems there is less registration as the age of voters increases. It is also worthy to note that the Bangsamoro Autonomous Region of Muslim Mindanao (ARMM) differs from other regions in terms of the percentages of younger voters, which is much higher than in other regions.
plot_pct_feature('sex', 10)
In terms of the sex of registered voters, regions across the Philippines are observed to have almost equal distribution when it comes to male and female registered voters. An interesting observation however is that NCR was observed to have the highest contribution of female voters among all regions. Additionally, Regions XII and VIII are shown to have the lowest contribution of female voters.
plot_pct_feature('civil status', 11)
We can see in the plot above that most registered voters are single or married, and registered voters that are widows and legally separated are much fewer in percentage. Additionally, NCR is the region with the highest number of registered voters that are single, and that Region II and ARMM are the regions with the lowest number.
plot_pct_feature('literacy', 12)
Finally, inspecting the mean literacy rate per region reveals that most regions have a literacy rate approaching 100%. However, ARMM differs greatly from the rest of the regions in the Philippines because this region only has a 51.02% literacy rate. It is also interesting to note that the regions with literacy rates of 99% and above are all from Luzon.
# calculate voter turnout
turnout = (df_pres.groupby(['Region'])[['num_voted', 'reg_voters']].sum())
turnout['voter_turnout'] = (turnout['num_voted'] / turnout['reg_voters']) * 100
# plot number of registered voters and voter turnout per region
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8))
ax = ax.flatten()
ax[0].bar(height=turnout.sort_values(by='reg_voters',
ascending=False)['reg_voters'],
x=turnout.sort_values(by='reg_voters', ascending=False).index,
color='purple')
ax[0].tick_params(axis='x', labelrotation=90)
ax[0].set_title('Figure 13: Number of Registered Voters in 2016 per Region '
'(in millions)\n\n',
fontsize=14)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[1].bar(height=turnout.sort_values(by='voter_turnout',
ascending=False)['voter_turnout'][:-1],
x=turnout.sort_values(by='voter_turnout',
ascending=False).index[:-1],
color='purple')
ax[1].axhline(turnout['voter_turnout'].mean(), color='green', linestyle='--')
ax[1].set_title('Figure 14: Regional Voter Turnout in the 2016 Elections '
'(in percent)\n\n',
fontsize=14)
ax[1].tick_params(axis='x', labelrotation=90)
ax[1].set_ylim(70, 86)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
plt.show()
Taking a closer look at the number of registered voters per region, we can see that the Calabarzon region had the highest number, followed by NCR and Central Luzon. However, the regions with the highest percentage of voter turn out were not these regions, but instead, the Eastern Visayas region, Ilocos region and Caraga region. NCR and the Calabarzon region actually had a below average turn out, along with the Zamboanga Peninsula. Likewise, overseas absentee voting has a very low voter turnout of 36%.
turnout = (df_pres.groupby(['Region', 'District',
'City/Municipality'])[['num_voted',
'reg_voters']].sum())
turnout = turnout.reset_index().drop('Region', axis=1)
turnout_map = pd.merge(turnout, ph_shp, on=['District', 'City/Municipality'],
how='right')
turnout_map = gpd.GeoDataFrame(turnout_map, geometry='geom')
turnout_map['voter_turnout'] = (turnout_map['num_voted'] /
turnout_map['reg_voters']) * 100
turnout_map.plot(column='voter_turnout', figsize=(15, 10), cmap='spring_r',
legend=True, scheme='UserDefined',
classification_kwds=dict(bins=[70,75,80,85,90,95]), k=7)
plt.title('Figure 15: '
'Voter Turnout per City/Municipality in the 2016 Elections\n\n',
fontsize=14)
plt.axis('off')
plt.show()
We can see through this visualization that cities surrounding the Ilocos region had a very high voter turnout. Cities in the Bicol region also had a high voter turnout along with its surrounding areas. For Mindanao, the areas of Davao Oriental and other eastern cities, along with northern cities and the Sulu Archipelago had a very high voter turnout.
def pivot_generator(df, groupby_levels, index_levels, column_level, values):
"""Generate a pivot table to convert voter data into user-item matrix."""
res = (pd.DataFrame(df.groupby(groupby_levels)[values].sum()
.reset_index()))
res = res.pivot(index=index_levels, columns=column_level, values=values)
return res
all_levels = ['bName', 'Region', 'District', 'City/Municipality']
index = all_levels[1:]
column = all_levels[0]
values = 'votes'
res_pres = pivot_generator(df_pres, all_levels, index, column, values)
res_vp = pivot_generator(df_vp, all_levels, index, column, values)
res_senator = pivot_generator(df_senator, all_levels, index, column, values)
pres_winners = pd.DataFrame(res_pres.idxmax(axis=1), columns=['winning_pres'])
vp_winners = pd.DataFrame(res_vp.idxmax(axis=1), columns=['winning_vp'])
pres_vp_winners = pd.merge(pres_winners, vp_winners,
left_index=True, right_index=True)
winner_map = pd.merge(turnout_map,
pres_vp_winners.reset_index().drop('Region', axis=1),
on=['District', 'City/Municipality'], how='left')
cmap_pres = ListedColormap(['darkorange', 'red',
'maroon', 'lightblue', 'gold'])
cmap_vp = ListedColormap(['red', 'lightblue', 'darkorange', 'maroon',
'gold'])
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 15))
ax = ax.flatten()
winner_map.plot(ax=ax[0], cmap=cmap_pres, column='winning_pres', legend=True,
categorical=True, legend_kwds={'loc': 'upper right'})
ax[0].set_title('Figure 16: Winning Presidential Candidate'
' per City/Municipality\n\n', fontsize=14)
ax[0].axis('off')
winner_map.plot(ax=ax[1], cmap=cmap_vp, column='winning_vp', legend=True,
categorical=True, legend_kwds={'loc': 'upper right'})
ax[1].set_title('Figure 17: Winning Vice Presidential Candidate'
' per City/Municipality\n\n', fontsize=14)
ax[1].axis('off')
plt.show()
If we take a long at where in the Philippines each presidential candidate had their stronghold of voters, we can see as expected that President Duterte's winning cities were largely in Mindanao. For Mar Roxas of the LP party, we can see he won in cities concentrated in Visayas. Binay and Poe also split Luzon. For vice presidential candidates, Leni Robredo had an overwhelming number of cities where she had won, and these were spread across Visayas mainly, but also more than half of Mindanao and parts of Luzon. Bongbong Marcos, however, had a very large concentration of winning cities in Luzon. Mindanao seemed to have been split between Duterte's running mate, Cayetano, Leni Robredo and Bongbong Marcos.
In this section, we explore the characteristics of the cities/municipalities in which Rodrigo Duterte, Mar Roxas, and Grace Poe won.
df_profile_pres = pd.merge(df_profile, pres_winners.reset_index(), how='left')
df_profile_pres = df_profile_pres.dropna()
def candidate_profile(df, var, candidate, col, lowy, highy, fignum):
"""Return profile summary of voter demographic for each candidate."""
age_groups = ['17-19', '20-24', '25-29', '30-34', '35-39', '40-44',
'45-49', '50-54', '55-59', '60-64', '65-above']
status_groups = ['single', 'married', 'widow', 'legally_seperated']
gender_groups = ['male', 'female']
df_profile = df[df[var] == candidate]
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 8))
ax = ax.flatten()
ax[0].bar(height=df_profile[age_groups].sum() / df[age_groups].sum(),
x=df_profile[age_groups].sum().index,
color=col)
ax[0].set_ylim(lowy, highy)
ax[0].set_title(f'Figure {fignum}: Percentage of People who Voted '
f'\nfor {candidate},\n per Age Group\n\n', fontsize=14)
ax[0].tick_params(axis='x', labelrotation=90)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[1].bar(height=df_profile[status_groups].sum() / df[status_groups].sum(),
x=df_profile[status_groups].sum().index, color=col)
ax[1].set_title(f'Figure {fignum+1}: Percentage of People who Voted '
f'\nfor {candidate},\n per Civil Status\n\n', fontsize=14)
ax[1].tick_params(axis='x', labelrotation=90)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[2].bar(height=df_profile[gender_groups].sum() / df[gender_groups].sum(),
x=df_profile[gender_groups].sum().index, color=col)
ax[2].set_title(f'Figure {fignum+2}: Percentage of People who Voted '
f'\nfor {candidate},\n per Sex\n\n', fontsize=14)
ax[2].tick_params(axis='x', labelrotation=90)
ax[2].set_ylim(lowy, highy)
ax[2].spines['top'].set_visible(False)
ax[2].spines['right'].set_visible(False)
candidate_profile(df_profile_pres, 'winning_pres',
'DUTERTE, RODY (PDPLBN)', 'maroon', lowy=0.4, highy=0.6,
fignum=18)
The visualizations above can show us the profile of voters that Rodrigo Duterte appealed to when he was running for president. It in terms of age groups, we can see that among all age groups, more than 50% of them chose Duterte as their president. Among all 30-34 year olds in the Philippines that cast their votes, around 57% of them voted for Duterte. Now, observing civil status, we see that among all the legally separated Filipinos, more than 65% of them voted for Duterte, possibly because of relatability. More than half of the single, married and widowed Filipinos who voted also chose Duterte as their president. Duterte was also the popular choice among both male and female voters.
candidate_profile(df_profile_pres, 'winning_pres',
'ROXAS, MAR DAANG MATUWID (LP)', 'gold',
lowy=0.1, highy=0.3, fignum=21)
For Mar Roxas, he appealed to around 23% of the voters belonging to the 17-19 age group. He also appealed to around 22% of the 65-above voters. However, only around 18-23% of voters of all ages chose Mar Roxas as their president. He appealed to more than 20% of single and widowed voters, as well as to around 19-20% of male and female voters.
candidate_profile(df_profile_pres, 'winning_pres',
'POE, GRACE (IND)', 'lightblue', lowy=0.1, highy=0.2,
fignum=24)
For Grace Poe, who ran independently, appealed to 19% of the 65-above voters and amongst 15% of the 17-19 age group. She also captured around 17-18% of the male and female votes.
In this section, we explore the characteristics of the cities/municipalities in which Leni Robredo, and Bongbong Marcos won.
df_profile_vp = pd.merge(df_profile, vp_winners.reset_index(), how='left')
df_profile_vp = df_profile_vp.dropna()
candidate_profile(df_profile_vp, 'winning_vp',
'ROBREDO, LENI DAANG MATUWID (LP)',
'gold', lowy=0.3, highy=0.5, fignum=27)
For Leni Robredo, she was able to capture around 40-46% of the votes from all age groups. She was popular among younger and older Filipionps, and less popular (but still over 40%) for the middle aged groups. Around 46% of the voters in the 17-19 age group chose her as their vice president and around 45% of 65-above voters voted for her. More than 40% of single, married and widowed voters voted for Robredo, with the highest being among widows. She also captured a large portion of both male and femate voters, at 42-43%.
candidate_profile(df_profile_vp, 'winning_vp',
'MARCOS, BONGBONG (IND)',
'maroon', lowy=0.3, highy=0.5, fignum=30)
Bongong Marcos also captured a large number of votes across all age groups, similar to Leni Robredo. However, unlike Robredo, he was more popular among the middle aged voters, and less popular among younger and older voters (but still more than 42%). Another difference is that among the legally separated voters, only around 36% of them voted for Robredo and more than 52% of them voted for Marcos. Among the male and female voters, more than 45% of them voted for Marcos.
In this plot, we check how many parties are represented in the National Elections and how many candidates are there that are running for each party.
partylists_df = pd.merge(res_pres, res_vp, left_index=True, right_index=True)
partylists_df = pd.merge(partylists_df, res_senator,
left_index=True, right_index=True)
parties = [re.findall(r'.*\((.*)\).*', partylists_df.columns[i])[0]
for i in range(len(partylists_df.columns))]
parties = pd.Series(parties).value_counts()
plt.figure(figsize=(15, 5))
plt.bar(x=parties.index, height=parties.values, color='purple')
plt.title('Figure 33: Partylist Distribution of 2016 Election Candidates\n',
fontsize=14)
plt.show()
As shown in the figure above, majority of the candidates running for election choose to run individually, at around 28%. However, around 10% of the candidates are part of the LP party, with the UNA party as a close second with around 8% of candidates belonging to them. It is also worth noting that only around 2% of candidates are members of the PDP Laban party.
In order to be able to model the underlying dynamics between voter demographics and election results, clustering analysis would be undertaken in order to identify the characteristics of voters that prefer a particular candidate. To undertake clustering, the merged information between the voter demographics along with the respective election turnout per City/Municipality was used. Columns that do not have much variance, such as the number of indigenous_people and the vote results of SEÑERES, ROY (WPPPMM) was not included. There are thirty features that would be used for the clustering. To be able to visualize the results of the clustering, we first perform dimensionality reduction.
# merge voter profile with election results
profile_votes = pd.merge(df_profile, res_pres.reset_index(),
on=['Region', 'District', 'City/Municipality'],
how='inner')
profile_votes = pd.merge(profile_votes, res_vp.reset_index(),
on=['Region', 'District', 'City/Municipality'],
how='inner')
profile_votes = profile_votes.set_index(['Region', 'District',
'City/Municipality'])
profile_votes = profile_votes.drop(['registered_voter', 'indigenous_people',
'SEÑERES, ROY (WPPPMM)'], axis=1)
Recognizing that the features for the dataset would come from different scales, it becomes necessary for us to be able to normalize the data. The StandardScaler() function of sklearn.preprocessing was first used before performing principal component analysis (PCA) on the dataset. As we want to be able to visualize the data, we only take the first two principal components of the transformed dataset and then visualize them.
The first two principal components takes into account for 82.13% of the variation in the entire dataset already.
scaler = StandardScaler()
df_scaled = scaler.fit_transform(profile_votes.to_numpy())
pca = PCA(n_components=2, random_state=100)
df_new = pca.fit_transform(df_scaled)
plt.figure(figsize=(15, 6))
plt.title('Figure 34: Scatter Plot of Transformed Dataset\n', fontsize=14)
plt.scatter(df_new[:, 0], df_new[:, 1], color='green')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
print('Cumulative Variance Explained: ',
np.cumsum(pca.explained_variance_ratio_)[1])
After being able to transform the data, we can now perform clustering.
Recognizing that we are working with a high dimensional dataset, using the traditional clustering algorithms taught in Data Mining and Wrangling (DMW) would lead to a significant amount of time in generating results. With this, the team used the Balanced Iterative Reducing and Clustering using Hierarchies (which is more popularly known as BIRCH). This clustering algorithm is a scalable version of the traditional clustering algorithms that were taught in DMW, as this first generating a small and compact summary (that retains as much information as possible) of the the large dataset and perform the clustering on that subset. With this, it is possible for us to make K-means clustering or Agglomerative clustering scalable.
For this implementation, we will be using K-means clustering for this implementation of BIRCH. Because of this, the only hyperparameter that we need to tune is the number of clusters (unlike in the Agglomerative clustering version in which we tune two hyperparameters -- the branching factor and the threshold). We check the optimal number of clusters from 2 to 11 and then use both the Calinski-Harabasz score and the Silhouette coefficient to determine the optimal value of k.
The main advantage of BIRCH is its online learning feature, which allows us to be able to iteratively implement the model by chunking the data. This is done by applying .partial_fit() to each partition. The entire dataset is first split into ten parts before fitting it to the model. In general, the results of BIRCH tends to closely approximate the results of either K-means clustering or Agglomerative clustering.
We now implement BIRCH in the subsequent code. Note that we would be performing the clustering on the original dataset. The transformed dataset would only be used for visualization purposes.
# implement BIRCH
# takes 2 mins and 30 seconds to run
birch_results = {}
for k in tqdm(list(range(2, 12))):
kmeans_birch = KMeans(k, random_state=143)
birch = Birch(n_clusters=kmeans_birch)
for sample in np.array_split(profile_votes.to_numpy(), 10):
birch.partial_fit(sample)
birch.partial_fit()
labels = birch.predict(profile_votes.to_numpy())
clusters, counts = np.unique(labels, return_counts=True)
birch_results[str(k)] = {'predictions': labels,
'ch_score':
calinski_harabasz_score(profile_votes.to_numpy(),
labels),
'sc_score':
silhouette_score(profile_votes.to_numpy(),
labels),
'num_clusters': len(clusters),
'clusters': clusters,
'counts': counts}
Visualized below are the resulting Calinski-Harabasz scores and Silhouette coefficient for each number of clusters k. From the plots, it can be seen that the highest Calinski-Harabasz score is achieved when the number of clusters is set to 8. Likewise, the silhouette coefficient is closest to 0.5 when the number of clusters is set to 8.
# convert results to a dataframe and plot the result
birch_results = pd.DataFrame.from_dict(birch_results).T
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 7))
ax = ax.flatten()
ax[0].plot(birch_results['num_clusters'], birch_results['ch_score'],
marker='.', markersize=20, color='purple')
ax[0].set_title('Figure 35: '
'Calinski-Harabasz Score for BIRCH Clustering\n', fontsize=15)
ax[0].set_xlabel('Number of Clusters', fontsize=12)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[1].plot(birch_results['num_clusters'], birch_results['sc_score'],
marker='.', markersize=20, color='green')
ax[1].set_title('Figure 36: '
'Silhouette Coefficient for BIRCH Clustering\n', fontsize=15)
ax[1].set_xlabel('Number of Clusters', fontsize=12)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
We visualize below the resulting clusters using the transformed data.
df_new_with_labels = np.append(df_new,
np.expand_dims(
birch_results.loc['8', 'predictions'],
axis=1), axis=1)
colors = ['green', 'red', 'blue', 'pink',
'purple','gold', 'fuchsia', 'brown']
plt.figure(figsize=(15, 6))
for klass, color in zip(range(0, 8), colors):
Xk = df_new_with_labels[df_new_with_labels[:, 2] == klass]
plt.scatter(Xk[:, 0], Xk[:, 1], c=color, label=klass)
plt.legend()
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Figure 37: BIRCH Clustering Results\n', fontsize=14)
plt.show()
Looking at the results, it can be seen that the clusters are distinctly separated from each other. However, there are just a few overlaps between the pink cluster (cluster 3) and the purple cluster (cluster 4). It can also be seen from the data that highly separated points (which may be considered as outliers) tend to form one cluster. Moreover, the visualization above would show that there is a high degree of imbalance among the different clusters. The purple cluster tends to have three elements, the brown cluster tends to only have four, and the red cluster tends to have only six.
We can further explore the resulting clusters by calculating their respective median values for all the variables used, and then identify the themes of each cluster by finding their respective maximum.
# display maximum median age, civil status, gender, president, and
# vice president
profile_votes['birch_predict'] = birch_results.loc['8', 'predictions']
cluster_medians = (profile_votes.groupby('birch_predict')
[profile_votes.columns[:-1]].median())
age_groups = ['17-19', '20-24', '25-29', '30-34', '35-39', '40-44',
'45-49', '50-54', '55-59', '60-64', '65-above']
status_groups = ['single', 'married', 'widow', 'legally_seperated']
gender_groups = ['male', 'female']
presidents = ['BINAY, JOJO (UNA)', 'DEFENSOR SANTIAGO, MIRIAM (PRP)',
'DUTERTE, RODY (PDPLBN)', 'POE, GRACE (IND)',
'ROXAS, MAR DAANG MATUWID (LP)']
vice_presidents = ['CAYETANO, ALAN PETER (IND)', 'ESCUDERO, CHIZ (IND)',
'HONASAN, GRINGO (UNA)', 'MARCOS, BONGBONG (IND)',
'ROBREDO, LENI DAANG MATUWID (LP)',
'TRILLANES, ANTONIO IV (IND)']
cluster_res = pd.DataFrame()
groups = [age_groups, status_groups, gender_groups,
presidents, vice_presidents]
for group in groups:
if group is presidents or group is vice_presidents:
new_res = pd.DataFrame(cluster_medians[group]).idxmax(axis=1)
else:
new_res = pd.DataFrame((cluster_medians[group] /
cluster_medians[group]
.sum(axis=0)) * 100).idxmax(axis=1)
cluster_res = pd.concat([cluster_res, new_res], axis=1)
cluster_res.columns = ['dominant_age_group', 'dominant_civil_status',
'dominant_gender', 'winning_pres', 'winning_vp']
cluster_res = cluster_res.reset_index().rename(columns={'index':
'cluster_number'})
cluster_res['num_cities'] = birch_results.loc['8', 'counts']
cluster_res
From the table above, it can be seen that all the clusters formed tend to have voted for President Rodrigo Duterte. This tends to show how influencial he was to people from different walks of life, as he was able to capture the votes of a wide array of demographics (whether young or old, single or married, or whether male or female).
What is more interesting though are the accompanying vice-president winners for each cluster. Cluster 0 and Cluster 2 both consists of people that are part of the older demographic, male dominated voting profile. Both of these clusters tend to prefer Rodrigo Duterte for president and Leni Robredo for vice president. This shows that although Duterte and Robredo came from different parties and have different political ideologies, a good number of Filipino voters have preferred voting them together for the president and the vice president role. Likewise, there is only one cluster which shows that the winning president and vice president came from the same party (Duterte and Alan Peter Cayetano, as seen in the fourth cluster). This somehow hints that Filipinos, in general, would not prefer voting for candidates that come together as one party.
To validate this insight further, the team has decided to implement Frequent Itemset Mining (FIM) on the data. This data mining algorithm was born out of market-basket analysis, which aims to determine the items in the supermarket that are often purchased together by customers. In the context of this study, FIM would be used to be able to identify which among the political candidates are voted together in each geographic location.
To perform Frequent Itemset Mining, the Python library pyFIM will be used. Furthermore, the FP-Growth algorithm will be used for this study due to its more efficient run times compared to breadth-first-search algorithms such as ECLAT and apriori. FP-Growth, like any other pattern growth algorithm, uses an enumeration tree in order to avoid generating candidate itemsets that are not in the database at all.
To implement the algorithm, the election results for the president, vice-president, and senatorial positions must be transformed following a user-item matrix. This format is a pandas DataFrame where the indices correspond to the city/municipality and the columns of the dataset correspond to the candidates for all positions being studied. The values in the dataset would be binary variables, which corresponds to 1 if the candidate won in that area and 0 otherwise.
In order to provide more robust insights, Association Rule Mining was performed as an extension of the Frequent Itemset Mining results. In Association Rule Mining, association rules are generated that describe the relationship between items and are composed of an antecedent (an item or set of items found in the data) and the consequent (an item present in combination with the antecedent). The consequent serves as a result, inference, or natural effect while the antecedent refers to the event that tends to cause (or is attributed to) the consequent.
The confidence and lift of these rules can then be calculated to determine the most releveant relationships. Confidence describes the conditional probability that a consequent is in a set of items given that it contains the antecedent, and the lift describes the increase in probability of the occurrence of a consequent in a set of items given that it contains the antecedent.
To validate the results from the clustering analysis, we first implement FIM on the 2016 presidential and vice-presidential results (which are aggregated on the city/municipality level). We set a minimum support of 90 cities or municipalities in order to consider a president-vice president pair as a frequent itemset. Likewise, we only take the 2-item itemsets as we are interested to know which candidates are often voted together.
db = list(winner_map[['winning_pres', 'winning_vp']].to_records(index=False))
result = sorted(fim.fpgrowth(db, target='s', supp=-90, zmin=2),
key=lambda x: (-x[1]))
result = pd.DataFrame(result, columns=['itemset', 'support'])[:7]
pres_freq_itemsets = list(result['itemset'])
pres_freq_itemsets = [[x, y] for x, y in pres_freq_itemsets]
pres_freq_itemsets[-1] = ['DUTERTE, RODY (PDPLBN)',
'CAYETANO, ALAN PETER (IND)']
pres_freq_itemsets[2] = ['DUTERTE, RODY (PDPLBN)',
'MARCOS, BONGBONG (IND)']
def itemset_labeler(x):
"""Consolidate infrequent itemsets as other itemsets."""
if x not in pres_freq_itemsets:
return 'OTHER PRESIDENT-VP PAIR'
else:
return tuple(x)
winner_map['itemset'] = (winner_map['winning_pres'] + '<>' +
winner_map['winning_vp'])
winner_map['itemset'] = winner_map['itemset'].str.split('<>')
winner_map['itemset'] = winner_map['itemset'].apply(itemset_labeler)
pd.DataFrame(winner_map['itemset'].value_counts()).rename(columns={'itemset':
'support'})
To validate the results from the clustering analysis, we first implement FIM on the 2016 presidential and vice-presidential results (which are aggregated on the city/municipality level). We set a minimum support of 90 cities or municipalities in order to consider a president-vice president pair as a frequent itemset. Likewise, we only take the 2-item itemsets as we are interested to know which candidates are often voted together.
From the table above, it can be seen that Mar Roxas and Leni Robredo are voted together the most as this is observed in almost 400 cities/municipalities. This is expected given that both of them come from the same party. However, what is most interesting is that the Rodrigo Duterte-Leni Robredo candidate combination was observed in almost 20% of the cities/municipalities in the Philippines. This is significantly greater than the number of people that voted for the Rodrigo Duterte-Alan Peter Cayetano tandem (which was the current president's runningmate during the 2016 National Elections). This confirms the results of our clustering model regarding the existence of a Duterte-Robredo tandem in terms of voter preference.
Likewise, the table above would also show that Bongbong Marcos was voted by a lot of Duterte, Grace Poe, and Jejomar Binay supporters.
We visualize the location in which these frequent itemsets have won below.
cmap_itemsets = ListedColormap(['darkorange', 'red', 'maroon', 'royalblue',
'black', 'green', 'fuchsia', 'gold'])
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 15))
winner_map.plot(ax=ax, cmap=cmap_itemsets,
column='itemset', legend=True,
categorical=True, legend_kwds={'loc': 'upper right',
'bbox_to_anchor': (1.5, 1)})
ax.set_title('Figure 38: Winning President-VP Candidate'
' per City/Municipality\n', fontsize=14)
ax.axis('off')
plt.show()
We can have a deeper understanding of how voters select their candidates for president and vice president by visualizing the winning candidates per city/municipality. We can see the Binay-Marcos tandem won in cities to the north of Luzon. Central Luzon also chose Bongbong Marcos as their vice president, however he was paired with Grace Poe. On the other hand, cities in Southern Luzon primarily chose the tandem of Grace Poe and Leni Robredo. In most Visayan cities, the tandem of Roxas-Robredo won. And in the northern parts of Mindanao, cities chose Robredo as their vice president but she was paired with Duterte as president. The eastern parts of Mindanao chose Duterte as their president along with his endorsed vice presidential candidate, Cayetano.
From the plot above, the effects of geopolitics are highly evident as areas that are close to each other tend to have similar president and vice-presidential candidate preferences.
To make sense of the strength of relationship among the different frequent itemsets, we perform Association Rule Mining. By conducting this analysis, we are able to identify which political candidate contributed mostly to votes of another canididate.
The results of the association rule mining are shown below. We set the minimum confidence to 20% for this implementation. Note that each labeled pair correspond to the (antecedent, consequent) sequence.
result = sorted(fim.fpgrowth(db, target='r',
zmin=2, report='C', conf=0.2),
key=lambda x: -x[2])
conf_result = (pd.DataFrame(result, columns=['consequent',
'antecedent', 'confidence'])
.sort_values(by='confidence', ascending=False))
conf_result['antecedent'] = conf_result['antecedent'].apply(lambda x:
list(x)[0])
conf_result['direction'] = (conf_result['antecedent'] + ',\n' +
conf_result['consequent'])
plt.figure(figsize=(20,8))
plt.barh(y=conf_result['direction'][:4], width=conf_result['confidence'][:4],
color='purple', alpha=0.5)
plt.barh(y=conf_result['direction'][4:11],
width=conf_result['confidence'][4:11],
color='mediumturquoise', alpha=0.5)
plt.title('Figure 39: Association Rule Mining for '
'Presidential-VP Candidates\n', fontsize=16)
plt.xlabel('Confidence\n(probability of voting for candidate Y '
'if candidate X is voted)', fontsize=14)
plt.ylabel('Antecedent-Consequent Pair: Candidate (X, Y)', fontsize=14)
plt.show()
Performing Association Rule Mining and inspecting the itemsets with the most confidence, we can see that the pair with the highest confidence is the Roxas-Robredo itemset, with Roxas as the antecedent and Robredo as the consequent. This shows that when voters choose Roxas as their president, there is a high probability (91%) that they will choose Robredo as their vice president. However, it is interesting to note that when the pairing is switched and Robredo is the antecedent and Roxas is the consequent, the probability decreases to 45%. This would mean that when voters chose Robredo as their vice president, it was not as likely that they would choose Roxas as their president.
Additionally, when Duterte is the antecedent and Robredo is the consequent, we can see there is a high confidence of around 50% probability that voters would choose Robredo as their vice president.
We can also see that choosing Marcos as vice president has a high probability of around 85% when Binay is chosen as a voter's president, and this is the second highest confidence observed. It shows as well that there is around 55% probability that a voter would choose Marcos as a vice president when they have chosen Poe as their president.
For the subsequent implementations of FIM, we now include the results of the 2016 Senatorial Elections to the dataset. With this, each city/municipality would now have (1 President, 1 Vice President, 12 Senators) -- corresponding to the most voted candidates for that respective geographic area. Hence, each record would have 14 candidates.
Before performing FIM, we first perform some quick exploratory analysis on the senatorial elections results. Seen below are the Top 1 senatorial candidate per city/municipality and the number of cities/municipalities in which a senatorial candidate won.
# determine winners of the senatorial race per City/Municipality
pres_vp_winners = pd.merge(pres_winners, vp_winners,
left_index=True, right_index=True)
nlargest = 12
order = np.argsort(-res_senator.values, axis=1)[:, :nlargest]
senator_winners = pd.DataFrame(res_senator.columns[order],
columns=[f'top {i} senator'
for i in range(1, nlargest+1)],
index=res_senator.index)
national_winners = pd.merge(pres_vp_winners, senator_winners,
left_index=True, right_index=True)
senator_map = (pd.merge(turnout_map, senator_winners.reset_index()
.drop('Region', axis=1),
on=['District', 'City/Municipality'], how='left'))
top_senators = national_winners['top 1 senator'].value_counts()[:5]
def senator_itemset_labeler(x):
"""Consolidate infrequent itemsets as other itemsets."""
if x not in top_senators:
return 'OTHER SENATOR'
else:
return x
senator_map['label'] = (senator_map['top 1 senator']
.apply(senator_itemset_labeler))
cities_won = (senator_winners.apply(pd.value_counts).sum(axis=1)
.sort_values(ascending=False))
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 15))
ax = ax.flatten()
cmap_itemsets = ListedColormap(['gold', 'black', 'royalblue',
'red', 'green', 'violet'])
senator_map.plot(ax=ax[0], cmap=cmap_itemsets,
column='label', legend=True,
categorical=True, legend_kwds={'loc': 'upper right',
'bbox_to_anchor': (1.1, 1)})
ax[0].set_title('Figure 40: Top 1 Senatorial Candidate'
' per City/Municipality\n', fontsize=16)
ax[0].axis('off')
ax[1].bar(x=cities_won.index[:12], height=cities_won.values[:12],
color='purple')
ax[1].bar(x=cities_won.index[12:20], height=cities_won.values[12:20],
color='gray')
ax[1].axvline(11.5, linestyle='--', color='green')
ax[1].set_title('Figure 41: Number of Cities/Municipalities where Senatorial '
'Candidate Won\n', fontsize=16)
ax[1].tick_params(axis='x', labelrotation=90)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
For senatorial data, we can see that Frank Drilon of the Liberal Party won in the most number of cities, and he was popular in the Western Visayas region with scattered cities across Luzon and Mindanao also voting for him. Strong senatorial candidates were Manny Pacquiao and Vicente Sotto, winning in cities across Mindanao and Luzon respectively.
It must also be noted that the top 12 senatorial candidates that had won the most number of cities/municipalities from the chart above were also the ones that won seats in the 2016 senatorial elections.
Recognizing that most of the candidates in the 2016 National Elections either ran as independent or ran under the Liberal Party - (LP) (which was the incumbent party of former President Benigno Aquino at that time), we identify the locations in which independent candidates and candidates from Liberal Party won.
party_winners = national_winners.copy()
for col in party_winners.columns:
party_winners[col] = party_winners[col].str.extract('.*\((.*)\).*')
party_winners = party_winners.apply(pd.value_counts, axis=1).fillna(0)
party_map = pd.merge(turnout_map,
party_winners.reset_index().drop('Region', axis=1),
on=['District', 'City/Municipality'], how='left')
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 15))
ax = ax.flatten()
party_map.plot(ax=ax[0], column='LP', legend=True, cmap='cividis',
classification_kwds=dict(bins=[1, 2, 3, 4, 5, 6, 7]), k=7)
ax[0].set_title('Figure 41: Number of Liberal Party Candidates that Won \n'
'in the 2016 National Elections per City/Municipality\n\n',
fontsize=14)
ax[0].axis('off')
party_map.plot(ax=ax[1], column='IND', legend=True, cmap='copper',
classification_kwds=dict(bins=[1, 2, 3, 4, 5, 6, 7]), k=7)
ax[1].set_title('Figure 42: Number of Independent Candidates that Won \n'
'in the 2016 National Elections per City/Municipality\n\n',
fontsize=14)
ax[1].axis('off')
plt.show()
Drilling down on the Liberal Party, we can see there is a strong preference for this party among the Visayas cities and some of Mindanao cities. There is a decreased preference for this party in southern Mindanao and some parts of Luzon. Independent candidates also showed good results in Mindanao and Luzon.
We now perform FIM on the national elections dataset that included the winners of the senatorial elections per city/municipality. Any frequent itemset must have a minimum support of 1100 or approximately two-thirds of all the cities/municipalities in the Philippines. For this analysis, we also set the frequent itemsets to have a minimum size of four candidates. Through this analysis, we can check whether Filipinos generally vote straight from a single party.
# none of the most frequent itemsets for the senatorial elections generated
# an itemset that is purely dominated by LP
db_results = list(national_winners.to_records(index=False))
results = sorted(fim.fpgrowth(db_results, target='s',
supp=-1100, zmin=4),
key=lambda x: (-x[1]))
results = pd.DataFrame(results, columns=['itemset', 'support'])
results
From the table above, we can see that any of the 22 frequent itemsets of more than four candidates tend to have at least one politician that is not part of the Liberal Party. Candidates such as Manny Pacquiao, Ping Lacson, Risa Hontiveros (in 2016, she was not part of LP yet), and Dick Gordon tend to be part of these frequent itemsets. Even though there were eight senators that ran under the Liberal Party, the results above would show that no more than three of them are voted together in most cities/municipalities. With this, we arrive at the conclusion that Filipino voters generally do not vote straight in the elections. This somehow implies that Filipino voters tend to prefer having some sort of diversity in terms of the political philosophies of the candidates that they choose.
In order to be able to identify who are the politicians that tend to contribute towards the votes of another politician, we perform association rule mining from the resulting frequent itemsets in order to gauge the dynamics between the candidates. To evaluate the itemsets, we would be using the lift. It must be noted that a lift that is greater than 1 would imply that a vote for the antecedent candidate is highly associated to a vote for the consequent candidate, whereas a lift that is less than 1 (and very close to 0) would show that there is a negative association between the two candidates. It is recommended for the consequent candidate to form alliances with an antecedent candidate that has a lift value greater than 1, as this partnership can create more votes for the consequent candidate.
For a better appreciation of the results, we made an interactive dashboard in order to visualize the lifts for all candidates that ran for a national position in 2016. A minimum support of 20 was established to determine frequent itemsets.
# perform frequent itemset mining to identify candidates who are associated
# with each other
fim_res = sorted(fim.fpgrowth(db_results, target='r',
supp=-20, report='l', zmin=2, zmax=2, conf=0),
key=lambda x: (-x[2]))
results = pd.DataFrame(fim_res, columns=['consequent', 'antecedent',
'lift'])
results['antecedent'] = results['antecedent'].apply(lambda x: list(x)[0])
# get filtered associations for each candidate
dfs_dict = {}
visibles = []
for i, candidate in enumerate(results['consequent'].unique()):
lst = [0] * len(results['consequent'].unique())
dfs_dict[candidate] = results[results['consequent'] == candidate]
lst[i] = 1
visibles.append(list(map(bool, lst)))
# visualize interactive plot to identify top 15 other politicians
# (antecedent) that are voted along with the consequent
init_title = {'title': 'Lift of Top 15 Associated Candidates to '
'ROBREDO, LENI DAANG MATUWID (LP)'}
fig = go.Figure(layout=init_title)
# create plots for every consequent
for candidate in dfs_dict.keys():
# let Leni Robredo be the default value
if candidate == 'ROBREDO, LENI DAANG MATUWID (LP)':
fig.add_trace(go.Bar(x=dfs_dict[candidate]['antecedent'].values[:15],
y=dfs_dict[candidate]['lift'].values[:15],
marker_color='purple', visible=True))
else:
fig.add_trace(go.Bar(x=dfs_dict[candidate]['antecedent'].values[:15],
y=dfs_dict[candidate]['lift'].values[:15],
marker_color='purple', visible=False))
updatemenu = []
buttons = []
# create buttons for the dropdown menu
for i, candidate in enumerate(dfs_dict.keys()):
buttons.append(dict(method='update',
label=candidate,
args=[{"visible": visibles[i]},
{"title":
f'Lift of Top 15 Associated Candidates'
f' to {candidate}'}])
)
fig.add_shape(type='line',
x0=-1,
y0=1,
x1=15,
y1=1,
line=dict(color='green', dash='dot'),
xref='x',
yref='y'
)
# update the layout and render the interactive plot on the HTML file
updatemenu = []
main_menu = dict()
updatemenu.append(main_menu)
updatemenu[0]['buttons'] = buttons
updatemenu[0]['direction'] = 'down'
updatemenu[0]['showactive'] = False
fig.update_layout(updatemenus=updatemenu)
fig.update_layout(
autosize=False,
width=1000,
height=800)
fig.update_xaxes(
tickangle=90,
title_text="Antecedent",
title_font={"size": 20},
title_standoff=25)
fig.show(renderer='notebook')
The interactive plot above shows the lift scores of the frequent itemsets of candidates. Vice President Leni Robredo even though was part of the Liberal Party was still able to amass votes from supporters of non-LP candidates like Neri Colmenares, Mark Lapid, Sergio Osmena, and Francis Tolentino as seen with lift values being greater than 1. In contrast, her running mate Mar Roxas only had lifts greater than 1 for Liberal Party candidates. This reflects that Mar Roxas was not able to convince other Filipino voters aside from Liberal Party supporters to vote for him, which contributed to his loss on the 2016 Presidential Elections.
Selecting Leni Robredo as the consequent, we can see that she has a high association with candidates from the LP party such as Mar Roxas (lift of around 1.7), Ina Ambolodto (lift of 1.6) and TG Guingona (lift of around 1.45). However, there is also high lift shown with candidates outside her party as well such as Aldin Ali from the WPPPMM party and Shariff Albani who was running as an individual candidate. Generally, the associations for Robredo were from varied parties.
Selecting Rodrigo Duterte as the consequent also shows interesting associations. He displays much higher lift scores with his associations, reaching almost 2.5 for his association with Cayetano. However, we can see that Duterte shows high association with candidates from a wide range of parties, also among candidates that were running individually.
Lastly, selecting Bongbong Marcos as the consequent shows us that he is highly associated with choosing Binay as a president, and this lift score is more than 2. He also shows high association with Samuel Pagdilao who was associated with the PNP and running individually. Marcos also showed a high association with Grace Poe, with a lift score of almost 1.5. This behavior is supported by the results in Figure 16 and Figure 17, which showed that the areas that voted for Binay or Poe as their president tended to vote for Marcos as their vice president. Figure 38 further supports this by showing that the cities that voted for the Binay-Marcos and Poe-Marcos tandems were part of the Binay and Poe bailiwicks.
The reader is encouraged to toggle along the different candidates that are made available on the dashboard (especially the ones that are running for the 2022 National Elections), and make their respective inferences from them. From these results, political alliances among candidates can be identified.
Being able to learn that Filipino voters generally do not vote straight, we share some insights to the National Election candidates which may be helpful for their respective campaigns:
First, candidates for the presidency and vice-presidency must be receptive of any alliances that they gain especially if it comes from politicians with different political philosophies. Political candidates running for the two highest positions of the country must recognize the possibility of having their runningmate not win the elections. This was evident in both the 2010 and the 2016 National Elections where the Benigno Aquino-Jejomar Binay and the Rodrigo Duterte-Leni Robredo won respectively, as both of these pairs stemmed from different political parties.
Second, the Rodrigo Duterte endorsement would be very crucial for the upcoming 2022 National Elections. Looking at the lift of closely associated candidates to Manny Pacquiao, it can be seen that Rodrigo Duterte has a lift that is greater than 1. With this, it can be shown that the presidential campaign of Manny Pacquiao would significantly benefit from a Rodrigo Duterte endorsement. Moreover, as recent surveys still show a relatively strong support towards his administration, his blessing would definitely be advantageous to the candidate that would get it. As of this writing, it is still unknown who Rodrigo Duterte would be supporting.
Based on the results of the group's analysis, Frequent Itemset Mining has revealed that the behavior of voters during the 2016 Philippine elections was to generally not vote straight. It was observed that the frequent itemsets for senatorial candidates were generally heterogeneous, wherein voters chose candidates that were not members of the same party list. The same was observed for presidential and vice presidential candidates, wherein the tandems Duterte-Robredo and Duterte-Marcos outnumbered President Duterte's own endorsed tandem of Duterte-Cayetano.
This behavior may be attributed to Filipino citizens' attempts to check and balance, where voters may be averse to assembling a government that is dominated by a single party. It is possible that the fractional party list system present in the Philippines has lead to voters ignoring a candidates membership in a party, due to the fact that party lists are increasingly revolving around political alliances rather than policies and ideologies.
Voters are open to electing politicians with different political alignments, and this can even be seen in recent news where the Albay 2nd District Rep. Joey Salceda endorsed a Leni Robredo and Sara Duterte tandem for the 2022 elections. Additionally, the higher participation of younger age groups in the 2016 elections may suggest an inclination towards more educated choices in candidates especially with their increased access to data in this age of information.
The group proposes the following recommendations for further study:
More granular data can be used to implement Frequent Itemset Mining, such as data on the ballot level or precinct level if available, in order to create a more accurate picture on voting behavior. This would help political scientists have a better idea on the political leanings of politicians along with possible coalitions or alliances that can exist.
Likewise, this methodology may also be used to further improve the analysis of surveys and opinion polls. Instead of just publishing percentages of results, more information on the association preferences between presidential, vice presidential, and senatorial candidates can be included using the methodology in this study. This would allow politicians that are running for government office to have a better idea of the characteristics of their voters, allowing them to create a better campaign strategy.
[1] The Dilawans, the DDS, the Dead Tired, and the Dedma. (2021, March 5). RAPPLER. https://www.rappler.com/voices/imho/opinion-dilawans-dds-dead-tired-dedma/
[2] Abinales, PN. (2016). The 2016 Philippine elections: local power as national authority. Asia Pacific Bulletin, 344. East-West Center. Retrieved from https://scholarspace.manoa.hawaii.edu/bitstream/10125/41060/apb%20no.344.pdf
[3] Alis, C. (2016). scrapER2016 [electronic resource: python source code]. Asia Pacific Bulletin, 344. East-West Center. Retrieved from https://github.com/ianalis/scraper2016
[4] Ansolabehere, S. & Hersh, E. (2011). Gender, race, age, and voting: a research note. APSA 2011 Meeting Paper. Retrieved from https://ssrn.com/abstract=1901654
[5] Arugay, AA. (2017). The Philippines in 2016: the electoral earthquake and its aftershocks. Southeast Asian Affairs, 277-296. doi:10.2307/26492610
[6] Curato, N. (2016). Politics of anxiety, politics of hope: penal populism and Duterte's rise to power. Journal of Current Southeast Asian Affairs, 35(3), 91-109.
[7] Fermin, A. (2001). Prospects and scenarios for the party list system in the Philippines. Friedrich Ebert Stiftung Online Papers. Retrieved from https://library.fes.de/pdf-files/bueros/philippinen/50074.pdf
[8] Holmes, RD. (2016). The dark side of electoralism: opinion polls and voting in the 2016 Philippine presidential election. Journal of Current Southeast Asian Affairs, 35(3), 15-38.
[9] Jallorina, LB. (2021). Party-list perversion in the Philippines: An analysis on the rise of spurious party-list groups and its effect on the current legal framework of the party-list system. Retrieved from https://animorepository.dlsu.edu.ph/etdm_law/1
[10] Landé, C. 1996. Post-Marcos Politics: A Geographical and Statistical Analysis of the 1992 Presidential Election. Singapore: Institute of Southeast Asian Studies.
[11] McDermott, ML. (1998). Race and gender in low-information elections. Political Research Quarterly, 51(4), 895-918. doi:10.2307/449110
[12] Teehankee, JC. (2020). Untangling the party list system. In Hutchcroft, PD. (Ed.), Strong patronage, weak parties, (pp. 151-167). World Scientific. doi:10.1142/9789811212604_0009
[13] Teehankee, JC. (2018). Regional dimensions of the 2016 general elections in the Philippines: Emerging contours of federalism. Regional and Federal Studies, 28(3), 383-394. doi:10.1080/13597566.2018.1454911
[14] Teehankee, JC. (2020). Factional Dynamics in Philippine Party Politics, 1900–2019. Journal of Current Southeast Asian Affairs, 39(1), 98-123. doi:10.1177/1868103420913404
[15] INTERACTIVE: How voter turnout has changed in 1,611 Philippine towns & cities since 2007. (2016, April 18). Thinking Machines. https://stories.thinkingmachin.es/election-turnout/
[16] Rappler.com. (2015, July 28). ‘Philandering’ Rodrigo Duterte cause of marriage annulment. Rappler. https://www.rappler.com/newsbreak/inside-track/philandering-rodrigo-duterte-marriage-annulment
[17] Schraufnagel, S., Buehler, M., & Lowry-Fritz, M. (2014). Voter Turnout in Democratizing Southeast Asia A Comparative Analysis of Electoral Participation in Five Countries. Taiwan Journal of Democracy, 10(1). https://eprints.soas.ac.uk/18940/1/SchraufnagelBuehlerLowryToD2014.pdf
[18] David, C. C., & San Pascual, M. R. S. (2016). Predicting vote choice for celebrity and political dynasty candidates in Philippine national elections. Philippine Political Science Journal, 37(2), 82–93. https://doi.org/10.1080/01154451.2016.1198076
[19] Indigenous World 2020: Philippines. (2020, May 11). IWGIA - International Work Group for Indigenous Affairs. https://www.iwgia.org/en/philippines/3608-iw-2020-philippines.html
[20] Philippine Statistics Authority. (2013). Persons with Disability in the Philippines (Results from the 2010 Census) | Philippine Statistics Authority. https://psa.gov.ph/content/persons-disability-philippines-results-2010-census
[21] Senate of the Philippines. (2016). Press Release - Poe backed by Ilocos officials. http://legacy.senate.gov.ph/press_release/2016/0426_poe5.asp
